About the Project

This is my personal project page for the University of Helsinki statistics course named “Introduction to Open Data Science”.

About Me

My name is Pinja-Liina Jalkanen.

Please be forewarned that my statistics skills are rather crappy and usually limited to some simple descriptives and regressions. I do know some tech stuff quite well, though, so I don’t expect great difficulties with the R syntax, Git or the like. Kimmo Vehkalahti said on the first course lesson that data analysis is often 90% preparation of the data and 10% doing the analysis. I don’t know about the weighing on this course, but I usually don’t need help with the 90% at all; that’s mostly like, RTFM, STFW and the like anyway.

With regard to the 10% though, I don’t trust my skills at all.

Link to this project: https://github.com/pinjaliina/IODS-project


Analysis of the Learning Questionnaire

Introduction

The data of the analysis in this exercise was the Autumn 2014 Introductory Statistics Course Learning Questionnaire. The data has 60 variables and 183 observations; except for the few background related variables (age, attitude towards statistics, course points), most of the questions were learning-related questions with answers given on Likert scale, from 1 to 5.

For the analysis, I’ve averaged the variables related to deep learning, surface learning, and strategic learning. (For the initial data wrangling part of this exercise, see this R script).

Overview of the Data

The data is read in as follows:

# Read in the data
lrn2014a <- as.data.frame(read.table('data/learning2014.csv',  sep="\t", header=TRUE))

A scatter plot matrix of the variables of the data can be drawn as follows, coloured by the gender variable:

p <- ggpairs(lrn2014a, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

A summary of each of the variables of the data can be displayed as follows:

summary(lrn2014a)
##       age           attitude         points      gender 
##  Min.   :17.00   Min.   :14.00   Min.   : 7.00   F:110  
##  1st Qu.:21.00   1st Qu.:26.00   1st Qu.:19.00   M: 56  
##  Median :22.00   Median :32.00   Median :23.00          
##  Mean   :25.51   Mean   :31.43   Mean   :22.72          
##  3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:27.75          
##  Max.   :55.00   Max.   :50.00   Max.   :33.00          
##       deep            stra            surf      
##  Min.   :1.625   Min.   :1.583   Min.   :1.250  
##  1st Qu.:3.500   1st Qu.:2.417   1st Qu.:2.625  
##  Median :3.875   Median :2.833   Median :3.188  
##  Mean   :3.796   Mean   :2.787   Mean   :3.121  
##  3rd Qu.:4.250   3rd Qu.:3.167   3rd Qu.:3.625  
##  Max.   :4.875   Max.   :4.333   Max.   :5.000

As can be seen from the output, most of the variables – with the exception of the age of the students – are distributed quite randomly and only correlate weakly with the course points. The most significant exception is the attitude towards statistics, which correlates more strongly with the course points than any other variable. Except for the age and points, the distribution of most of the variables seems to be reasonably close to normal distribution. The gender variable demonstrates that the course was attended by significantly more female than male students.

Fitting of a Regression Model

A linear regression model can be fit to the data as follows:

lrn2014_model <- lm(points ~ attitude + stra + surf, data = lrn2014a)

The chosen explanatory variables for the model were attitude, strategic learning and surface learning. The summary of the model can be printed as follows:

summary(lrn2014_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra        -0.58607    0.80138  -0.731  0.46563    
## surf         0.85313    0.54159   1.575  0.11716    
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

However, as can be seen from the model summary, the estimates of strategic and surface learning have no statistical significance explaining the course points; given the weak correlation of those variables with the points this was somewhat expected. It thus makes more sense to eliminate at least the variable that has the lowest probability value, strategic learning:

lrn2014_model <- lm(points ~ attitude + surf, data = lrn2014a)
summary(lrn2014_model)
## 
## Call:
## lm(formula = points ~ attitude + surf, data = lrn2014a)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## surf         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Analysis of the Summary

With the removal of that value the statistical significance of the surface learning estimates improves somewhat to near significant levels and can be left to the model. In practice its effect to the model is so tiny that it nearly as well be left out, but because it is reasonably close to statistical significance and also gives highest adjusted R2 value – the adjusted R2 of a model where attitude is the only explanatory variable as reported by summary() would be 0.1856 – its inclusion to the model is justified.

While the model estimates are statistically significant, the model as a whole is not very good: the multiple R2 value is only 0.20, which means that about 80 % of the relationship between the dependent variable – course points – and the explanatory variables remains unexplained. Thus, any predictions based on the model alone might not be very reliable.

Diagnostic Plots and Assumptions of the Model

In addition to linearity, linear models are fitted with the assumption that:

  1. The errors of the model are normally distributed.
  2. The errors are not correlated.
  3. The size of the errors does not depend on the explanatory variables.

The validity of these assumption can be tested by analysing the residuals of the model. This can be done with the help of different kinds of diagnostic plots. In the following figure three different plots are drawn:

  • A residuals vs. fitted values plot
  • A Q–Q-plot
  • A Residuals vs. leverage plot
par(mfrow = c(1,3)) # Set some graphical params.
# It seems that the drawing order of the plot is independent of the vector
# order and can't be changed.
plot(lrn2014_model, which = c(1,2,5))

Interpretation of the plots:

  1. The Q–Q-plot demonstrates that the standardised residuals of the model fit to the theoretical quantities reasonably well, so the normal distribution assumption is valid.
  2. The residuals vs. fitted values plot doesn’t show any kind of pattern, so the errors are not correlated and their size is independent of the explanatory variables (their σ2 is constant).
  3. In addition to checking the abovementioned assumptions, it is recommended to check that no single observation has an oversized effect on the model, because these might distort the model coefficients. As the x-axis scale of the residuals vs. leverage plot is relatively narrow with no significant outliers, we can conclude that the model is not distorted by any single observation.

Analysis of Alcohol Consumption

Introduction

The data of the analysis in this exercise was Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino (link: https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION). For the analysis, I’ve also created a combined total alcohol consumption variable (sum of weekday and weekend consumption) and created a separate logical high use variable, based on the total consumption. (For the initial data wrangling part of this exercise, see this R script).

Overview of the Data

The data is read in as follows:

# Read in the data
alc <- as.data.frame(read.table('data/alc.csv',  sep="\t", header=TRUE))

A glimpse of all of the variables of the data can be displayed as follows:

glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex        <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address    <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus    <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob       <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob       <fctr> teacher, other, other, services, other, other...
## $ reason     <fctr> course, course, other, home, home, reputation...
## $ nursery    <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet   <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ guardian   <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup  <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup     <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid       <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher     <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic   <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE...

Purpose of the Analysis

As can be seen from the above variable list, there are both numerical and factorial background variables. The target variable for this analysis is the binary high/low alcohol consumption variable. To analyse that, I’ve chosen the following four variables that I assume are indicative of students’ alcohol consumption:

  1. Absence of lessons (variable absence). I assume that students who skip lessons drink a lot drink more.
  2. Going out (variable goout). I assume that students who go out more do so to drink, so they drink more.
  3. Final grade (variable G3). I assume that students with poor grades drink more.
  4. Study time (variable studytime). I assume that students who spend less time studying spend more time drinking.

A summary and some plots of the chosen variables are shown below (boxplots’ whiskers extend to 75% of the interquartile range). I also grouped the box plots by sex to see any potential differences between them:

summary(alc[c('absences','goout','G3','studytime')])
##     absences          goout             G3          studytime    
##  Min.   : 0.000   Min.   :1.000   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:2.000   1st Qu.: 8.00   1st Qu.:1.000  
##  Median : 3.000   Median :3.000   Median :11.00   Median :2.000  
##  Mean   : 5.319   Mean   :3.113   Mean   :10.39   Mean   :2.034  
##  3rd Qu.: 8.000   3rd Qu.:4.000   3rd Qu.:14.00   3rd Qu.:2.000  
##  Max.   :75.000   Max.   :5.000   Max.   :20.00   Max.   :4.000
p1 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(goout)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")
# The multiplot() is defined in the init section of this file. For details, see
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
multiplot(p1, p2, p3, p4, cols = 2)

Logistic Regression Analysis

Based on the above exploration of the variables, it looks like all my previously stated hypothetical assumptions were true to at least some extent, with perhaps the exception of final grade. To confirm this, I did a logistic regression analysis using my chosen variables as the explanatory variables.

In the following, a model is fitted to the data and summarised.

m <- glm(high_use ~ absences + goout + G3 + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + G3 + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8166  -0.7495  -0.5034   0.8071   2.5083  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.62195    0.62727  -4.180 2.92e-05 ***
## absences     0.05600    0.01643   3.407 0.000656 ***
## goout        0.74536    0.12013   6.204 5.49e-10 ***
## G3           0.01269    0.02848   0.446 0.655854    
## studytime   -0.60004    0.16852  -3.561 0.000370 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 383.16  on 377  degrees of freedom
## AIC: 393.16
## 
## Number of Fisher Scoring iterations: 4

By the model summary, it looks like my hypothesis was wrong with regard the final grade, which wasn’t a good predictor at all of high alcohol consumption: it bears no statistical significance whatsoever to it. All the other chosen explanatory variables are relatively strong predictors. Calculating the odds ratios hints to the same direction:

or <- exp(coef(m))
or
## (Intercept)    absences       goout          G3   studytime 
##  0.07266094  1.05759534  2.10719756  1.01277150  0.54879019

As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade. Interestingly, the absences are not much of a factor either: a student with lot of absences is only 1.06 times more likely to consume a lot of alcohol. With regard outgoingness and studytime the results are, however, very clear: a student who goes out a lot is two times more likely to also drink a lot. Same goes for studytime: a student who studies a lot is almost half less likely to drink a lot. But while absence doesn’t increase the likelihood of high use a lot, comparing its odd ratio to its confidence interval still confirms its statistical significance:

ci <- exp(confint(m))
cbind(or, ci)
##                     or      2.5 %    97.5 %
## (Intercept) 0.07266094 0.02042166 0.2402599
## absences    1.05759534 1.02602413 1.0954317
## goout       2.10719756 1.67554867 2.6863511
## G3          1.01277150 0.95846241 1.0720167
## studytime   0.54879019 0.38995929 0.7564410

As shown by the confidence intervals of the odd ratios, the confidence intervals of all the other explanatory variables except that of final grade (G3) steer well clear of one, which means that the likelihoods that the odds predicted by them are low. With regard the final grade, however, my initial hypothesis was clearly wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to refit it without that variable:

m <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ absences + goout + studytime, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8376  -0.7530  -0.4927   0.8091   2.4967  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.47603    0.53168  -4.657 3.21e-06 ***
## absences     0.05627    0.01651   3.409 0.000651 ***
## goout        0.73711    0.11835   6.228 4.72e-10 ***
## studytime   -0.59422    0.16800  -3.537 0.000405 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 383.36  on 378  degrees of freedom
## AIC: 391.36
## 
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
##                             2.5 %    97.5 %
## (Intercept) 0.08407616 0.02874986 0.2322992
## absences    1.05788483 1.02622976 1.0958710
## goout       2.08988646 1.66701101 2.6539672
## studytime   0.55199236 0.39259400 0.7600378

As shown by the above statistics, the remaining explanatory variables are now all statistically highly significant and have confidence intervals well clear of the value one, so the model is now ready to be used for predictions.

Predicting with the Model

The model can be used for predictions as follows:

# Predict the probability.
probabilities <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
##         prediction
## high_use FALSE TRUE Sum
##    FALSE   246   24 270
##    TRUE     66   46 112
##    Sum     312   70 382
round(addmargins(prop.table(tbl)), 2)
##         prediction
## high_use FALSE TRUE  Sum
##    FALSE  0.64 0.06 0.71
##    TRUE   0.17 0.12 0.29
##    Sum    0.82 0.18 1.00

As the tables show, the model is too careful in its predictions; it predicts less occurrences of high use than what the actual data shows. This can also be demonstrated graphically:

hu <- as.data.frame(prop.table(table(alc$high_use)))
pred <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)

The actual model training error can be calculated as follows (note that this is a function only because one is needed later on for cv.glm()):

mloss <- function(obs, prob) {
  res <- ifelse(prob > 0.5, 1, 0)
  mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
## [1] 0.24

The training error is 24%, thus the model has a little over 75% accuracy. This isn’t perfect, but likely still better than any simple guessing strategy, given that by guessing alone I wasn’t able to predict my chosen variables’ statistical significance correctly.

Cross-validation (bonus task)

To test the model further, cross-validation can be performed. The following performs a 10-fold cross-validation:

cv <- cv.glm(data = alc, cost = mloss, glmfit = m, K = 10)
cv$delta[1] # Print the average number of wrong predictions.
## [1] 0.2382199

The average training error of the 10-fold cross-validation is 0.24, which is already better performance than the model introduced in the DataCamp exercise has. However, the model could be improved further by adding more variables. This model, however, does not include sex as a variable, while according to the DataCamp exercise it is a statistically significant variable. Thus, we could try adding it and run the cross-validation to further improve the model:

m2 <- glm(high_use ~ absences + goout + studytime + sex, data = alc, family = "binomial")
summary(m2) # Summarise the model.
## 
## Call:
## glm(formula = high_use ~ absences + goout + studytime + sex, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7586  -0.7653  -0.4897   0.7254   2.5996  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.23486    0.59934  -5.397 6.76e-08 ***
## absences     0.06269    0.01687   3.716 0.000202 ***
## goout        0.74151    0.12061   6.148 7.86e-10 ***
## studytime   -0.45081    0.17155  -2.628 0.008594 ** 
## sexM         0.81167    0.26937   3.013 0.002585 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 374.08  on 377  degrees of freedom
## AIC: 384.08
## 
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m2)), exp(confint(m2))) # Count odds and their confidence intervals.
##                            2.5 %    97.5 %
## (Intercept) 0.0393658 0.01172067 0.1236057
## absences    1.0646959 1.03217181 1.1038908
## goout       2.0990949 1.66738691 2.6783765
## studytime   0.6371102 0.45038109 0.8845879
## sexM        2.2516753 1.33375527 3.8439285
# Predict the probability.
probabilities2 <- predict(m2, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability2 = probabilities2)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction2 = probability2 > 0.5)
# Re-run cross-validation and print the average number of wrong predictions.
cv2 <- cv.glm(data = alc, cost = mloss, glmfit = m2, K = 10)
cv2$delta[1]
## [1] 0.2094241

The average training error is now only 0.21, thus clearly lower than in the previous model.


Classification of the Boston Dataset

Introduction

The data of this classification and clustering exercise was the Housing Values in Suburbs of Boston dataset, henceforth referred to just as Boston. It is available from the R package MASS, which includes Functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S” (4th edition, 2002).. According to the reference manual of the package, the dataset includes 506 observations of the following 14 variables:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000 × (Bk − 0.63)2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1,000s.

According to the reference manual, the data is based on the following sources:

  • Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.
  • Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Overview of the Data

After the MASS package has been loaded – by calling library(MASS) – the built-in datasets can be prepared simply by calling their names using data(). This enables accessing them by their names. They’re, however, loaded by using so-called lazy loading and thus only become data frames when they’re accessed for the first time (see help("data") for details):

# Prepare the data. (Quotes are optional but recommended; see help("data").)
data("Boston")

Glimpse the data to confirm it matches the reference manual:

# The data is only turned into an actual data frame at this point.
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, ...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, ...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, ...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 1...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 1...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 3...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15,...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 1...

Explore the data graphically:

# Subplot axis labels are partially too cramped, but I failed to find a working solution for that.
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p

Show variable summaries:

summary(Boston)
##       crim                zn             indus      
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
##  Median : 0.25651   Median :  0.00   Median : 9.69  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74  
##       chas              nox               rm             age        
##  Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
##  1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
##  Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
##  Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
##  3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
##  Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
##       dis              rad              tax           ptratio     
##  Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
##  Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
##      black            lstat            medv      
##  Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
##  1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
##  Median :391.44   Median :11.36   Median :21.20  
##  Mean   :356.67   Mean   :12.65   Mean   :22.53  
##  3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :396.90   Max.   :37.97   Max.   :50.00

As can be seen from the output, almost all of the variables variables are continuous, with the exception of the Charles River bound tract (chas) variable, which is binary, and the radial highways accessibility variable (rad), which is an index, but still measured on an interval level.

The distribution of most variables is rather skewed, except for the dwelling size (number of rooms; rm), which is normally distributed and median value of owner-occupied homes, which is nearly normally distributed. Some variables are very highly skewed, like the crime rate (crim), proportion of land zoned for very large lots (zn) and proportion of black people (black).

There are a lot of variables in the data, so it might be helpful to calculate a correlation matrix of the data and visualise it:

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex=0.6)

A numerical equivalent of a correlation plot including only the highest correlations might be created as follows. I personally found this to be the most intuitive way to find the highest correlations:

cb <- as.data.frame(cor(Boston)) # Create a DF of the correlation matrix.
cor_matrix_high <- as.data.frame(matrix(nrow = 14, ncol = 14)) #Copy
colnames(cor_matrix_high) <- colnames(cor_matrix) #the structure of
rownames(cor_matrix_high) <- rownames(cor_matrix) #cor_matrix.
cor_threshold <- 0.7
# Loop through the correlation matrix and save only values that exceed the threshold.
for(col in names(cb)) {
  for(row in 1:length(cb[[col]])) {
    if(abs(cb[[col,row]]) > cor_threshold & abs(cb[[col,row]]) < 1) { 
      cor_matrix_high[col,as.character(rownames(cb)[row])] <- round(cb[[col,row]], digits = 2)
    }
  }
}
# Print the matrix.
cor_matrix_high
##         crim zn indus chas   nox rm   age   dis  rad  tax ptratio
## crim      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## zn        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## indus     NA NA    NA   NA  0.76 NA    NA -0.71   NA 0.72      NA
## chas      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## nox       NA NA  0.76   NA    NA NA  0.73 -0.77   NA   NA      NA
## rm        NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## age       NA NA    NA   NA  0.73 NA    NA -0.75   NA   NA      NA
## dis       NA NA -0.71   NA -0.77 NA -0.75    NA   NA   NA      NA
## rad       NA NA    NA   NA    NA NA    NA    NA   NA 0.91      NA
## tax       NA NA  0.72   NA    NA NA    NA    NA 0.91   NA      NA
## ptratio   NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## black     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## lstat     NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
## medv      NA NA    NA   NA    NA NA    NA    NA   NA   NA      NA
##         black lstat  medv
## crim       NA    NA    NA
## zn         NA    NA    NA
## indus      NA    NA    NA
## chas       NA    NA    NA
## nox        NA    NA    NA
## rm         NA    NA    NA
## age        NA    NA    NA
## dis        NA    NA    NA
## rad        NA    NA    NA
## tax        NA    NA    NA
## ptratio    NA    NA    NA
## black      NA    NA    NA
## lstat      NA    NA -0.74
## medv       NA -0.74    NA

Variables tax (property taxes) and rad have a remarkably high correlation with each other: 0.91. This might mean that the taxation is at least partially based on the highway accessibility. Other relatively high correlations include e.g. the negative correlation between the amount of nitrogen oxides (nox) and rad (-0.77), positive correlation between industry presence (indus) and nox (0.76) and negative correlation between the proportion of pre-1940s buildings (age) and rad.

Standardising and Categorising the Data

To further analyse the dataset, the dataset must be standardised, i.e. all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515  
##       age               dis               rad         
##  Min.   :-2.3331   Min.   :-1.2658   Min.   :-0.9819  
##  1st Qu.:-0.8366   1st Qu.:-0.8049   1st Qu.:-0.6373  
##  Median : 0.3171   Median :-0.2790   Median :-0.5225  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.9059   3rd Qu.: 0.6617   3rd Qu.: 1.6596  
##  Max.   : 1.1164   Max.   : 3.9566   Max.   : 1.6596  
##       tax             ptratio            black        
##  Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We can now see from the output of summary() that this works as intended. We also need to categorise our target variable – crim – to classify it:

# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
## 
##      low  med_low med_high     high 
##      127      126      126      127

Dividing the Data and Fitting the Model

To create the LDA model and to test it, the data has to be divided into training and testing sets. This can be done as follows, choosing randomly 80% of the data to be used for training:

n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n,  size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)

With the data divided, it is now possible to fit the LDA model on the training set:

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2351485 0.2623762 0.2450495 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       0.9310036 -0.8997841 -0.19661560 -0.8657659  0.39225629
## med_low  -0.1207535 -0.2173649 -0.02367011 -0.5403276 -0.17901891
## med_high -0.3893509  0.1798569  0.21052285  0.4044867  0.09697913
## high     -0.4872402  1.0171737 -0.07348562  1.0792632 -0.40302139
##                 age        dis        rad        tax     ptratio
## low      -0.8783624  0.8938697 -0.6826072 -0.7189505 -0.43026262
## med_low  -0.2498804  0.3086852 -0.5478716 -0.4335468  0.01373224
## med_high  0.4321568 -0.3845833 -0.3924693 -0.3093288 -0.32719705
## high      0.8306075 -0.8588203  1.6375616  1.5136504  0.78011702
##               black       lstat        medv
## low       0.3734192 -0.74401589  0.46804882
## med_low   0.3039552 -0.08346247 -0.04934803
## med_high  0.0941871  0.04170050  0.16640034
## high     -0.8108170  0.94270106 -0.73963535
## 
## Coefficients of linear discriminants:
##                   LD1         LD2        LD3
## zn       0.0955965246  0.70624698 -0.8239310
## indus   -0.0358311928 -0.18625688  0.3297031
## chas    -0.0856091798 -0.13405834  0.0568785
## nox      0.3437240310 -0.71040440 -1.3714271
## rm      -0.0994877919 -0.09962215 -0.2056183
## age      0.3289055211 -0.36126888  0.0388215
## dis     -0.0681938096 -0.28896012  0.2105911
## rad      3.0362894951  0.91854428 -0.2671302
## tax      0.0005978297 -0.02971497  0.7098499
## ptratio  0.0888520020 -0.02644323 -0.1454231
## black   -0.1275380985  0.02144684  0.1211509
## lstat    0.2261932426 -0.22638028  0.3334836
## medv     0.1500272939 -0.41970832 -0.1130043
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9441 0.0420 0.0138

As we used quantiles to categorise the original variable, we’ve four classes. Thus, the output shows that we’ve three linear discriminants, as expected. Of these, the first explains vast majority – 94% – of the between-group variance.

The first two of the model’s linear discriminants can be visualised follows. A helper function is needed to draw the arrows in the biplot:

# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.

It’s possible to visualise all three discriminants, but the lda.arrows() function is incompatible with that:

plot(lda.fit, dimen = 3, col = classes, pch = classes) # Plot.

Predicting with the Model

We’ve already prepared the test set above, so it’s now possible to move straight into predictions:

lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
addmargins(table(correct = correct_classes, predicted = lda.pred$class))
##           predicted
## correct    low med_low med_high high Sum
##   low       17       6        0    0  23
##   med_low   11      15        5    0  31
##   med_high   1       6       13    0  20
##   high       0       0        0   28  28
##   Sum       29      27       18   28 102

(I used ```addmargins() when tabulating, because in my opinion that’s more illustrative and helps. comparisons.) As seen from the table, the model did predict the highest of crime rates reliably, but the “med_low” category is overrepresented relative to the “low” and “med_high” categories. Thus, the model can be used to make crude predictions, but it’s hardly perfect. It might be better to use an unsupervised method and cluster the data instead of classifying it.

Clustering the Data

To cluster the data, it needs to be loaded and standardised again, and a distance matrix created out of it:

boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4620  4.8240  4.9110  6.1860 14.4000

We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:

km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:

k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
  kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)

# Visualize the results.
plot(1:k_max, twcss, type='b')

The optimal number of cluster is where the TWCSS drops radically; however, by inspecting the above plot, it’s somewhat debatable, whether this happens with with just two or four clusters. Thus, for comparison, let’s re-cluster the data with just two clusters:

km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.

As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two.


Tea and Human Development

Introduction

This chapter has two distinct parts: in the first part, a Principal Component Analysis (PCA) of the combined data of the year 2015 Human Development Index (HDI) and Gender Inequality Index (GII) of the United Nations is done. In the second part, a Multiple Correspondence Analysis (MDA) is done for the “tea” dataset included in the FactoMineR R library.

The Human Data

The combined HDI + GII data – henceforth referred to as the Human data – was prepared by joining the two datasets by country, calculating two additional values from the existing values, excluding variables that were deemed irrelevant for the analysis, and leaving off any records that were incomplete or that referred to geographic areas other than countries. (For the data wrangling script that was used to prepare the data, see here).

The data is read in as follows:

# Read in the data
human <- as.data.frame(read.table('data/human.csv',  sep="\t", header=TRUE))

Glimpse the data to explore its structure and dimensions:

# The data is only turned into an actual data frame at this point.
glimpse(human)
## Observations: 155
## Variables: 8
## $ se_f_of_m  <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0....
## $ lfp_f_of_m <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0....
## $ edu_exp    <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5...
## $ life_exp   <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1...
## $ gni_cap    <int> 64992, 42261, 56431, 44025, 45435, 43919, 3956...
## $ mmr        <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27...
## $ abr        <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5...
## $ mp_share   <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4...

As seen from the output, the dataset has eight variables and includes data for 155 countries. The included variables may be described as follows:

  • se_f_of_m = Share of female population with secondary education / Share of male population with secondary education
  • lfp_f_of_m = Share of female population that participates in labour force / Share of male population that participates in labour force
  • edu_exp = Expected years of education
  • life_exp = Life expectancy at birth
  • gni_cap = Gross national income (GNI) per capita (dollars, purchasing power parity)
  • mmr = Maternal mortality rate
  • abr = Adolescent birth rate
  • mp_share = Share of female representatives in the national parliament

Overview of the Human Data

A graphical overview of the Human data and summaries of its variables can be displayed as follows:

ggpairs(human)

summary(human)
##    se_f_of_m        lfp_f_of_m        edu_exp         life_exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##     gni_cap            mmr              abr            mp_share    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

As seen from the plots, the distribution of most of the variables is relatively skewed; the most notable exceptions are gender ratio of population with secondary education and expected years in education, which are reasonably close to the normal distribution. The min and max values and the scatter plots both show that some variables also have quite significant outliers, e.g. GNI per capita and maternal mortality rate.

The correlation figures also show that some variables correlate highly with each other. To spot such high correlation more easily, it’s useful to draw a correlation diagram:

corrplot(round(cor(human), digits = 2), method = "circle", type = "upper", tl.pos = "d", tl.cex = 0.8)

The plot demonstrates clearly that the strongest correlation of all is the strong negative correlation between life expectancy and maternal mortality rate; from the previous plot we can see that its value is -0.857. The next strongest correlations are the positive correlations between life expectancy and education expectancy (0.789) and maternal mortality rate and adolescent birth rate (0.759).

PCA of the Human Data

To lower the dimensionality of the data, a principal component analysis (PCA) can be performed for it. This can be done, summarised and plotted as follows, using the singular value decomposition (SVD) method:

pca_human <- prcomp(human)
s_human_nonstd <- summary(pca_human)
s_human_nonstd
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000
##                           PC7    PC8
## Standard deviation     0.1912 0.1591
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000

However, as shown by the summary, when done this way, the first principal component explains virtually all of the variance, and plotting it doesn’t’ make much sense either; R actually throws some errors (that are deliberately displayed here), because it is unable to draw the plot arrows properly, and the resulting plot doesn’t really make much sense:

# Percentages of variance for the plot titles.
pr_shns <- round(100*s_human_nonstd$importance[2, ], digits = 1)
pc_shns_lab <- paste0(names(pr_shns), " (", pr_shns, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shns_lab[1], ylab = pc_shns_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped

This is because PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. Thus, at least unless the variables are already very close to the normal distribution, it is important to standardise the variables – and in this case, most of the variables weren’t even close to the normal distribution. After standardising the variables, the results look very different indeed:

human_std <- scale(human) # Standardise the variables.
pca_human_std <- prcomp(human_std)
s_human_std <- summary(pca_human_std)
s_human_std
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# Percentages of variance for the plot titles.
pr_shs <- round(100*s_human_std$importance[2, ], digits = 1)
pc_shs_lab <- paste0(names(pr_shs), " (", pr_shs, "%)")
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shs_lab[1], ylab = pc_shs_lab[2])

Interpreting the PCA Results

The results now actually make sense, and as can be read from both the summary of the analysis and from the biplot of it, the first principal component explains 53.6% of the total variance of the original eight variables, and the second principal component explains 16.2% of it.

By inspecting the plot, we can see that the first principal component represents variables related mostly to poverty and the second principal components variables related mostly to equality, and that these two don’t correlate with other.

We can thus say that poverty explains most of the total variance in the data, but equality also explains some of it. From the plot arrows, we can see that high maternal mortality rate and adolescent birth rate correlate strongly with poverty and that high life expectancy, high educational expectancy, high ratio of females with secondary education and high GNI have a strong negative correlation with it. Further, we can see that high ratio of female MPs and high ratio of female participation in the labour force have strong positive correlation with equality, even though equality explains much less of the total variability in the data than poverty.

And Now it’s Teatime!

As the final part of the Exercise 5, a Multiple Correspondence Analysis (MCA) was done for the “tea” dataset included in the FactoMineR R library. The package description explains the data as follows: ”We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”

Let’s load the data and look at the structure and dimensions of the dataset: Glimpse the data to explore its structure and dimensions:

data("tea")
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast        <fctr> breakfast, breakfast, Not.breakfast, No...
## $ tea.time         <fctr> Not.tea time, Not.tea time, tea time, N...
## $ evening          <fctr> Not.evening, Not.evening, evening, Not....
## $ lunch            <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lu...
## $ dinner           <fctr> Not.dinner, Not.dinner, dinner, dinner,...
## $ always           <fctr> Not.always, Not.always, Not.always, Not...
## $ home             <fctr> home, home, home, home, home, home, hom...
## $ work             <fctr> Not.work, Not.work, work, Not.work, Not...
## $ tearoom          <fctr> Not.tearoom, Not.tearoom, Not.tearoom, ...
## $ friends          <fctr> Not.friends, Not.friends, friends, Not....
## $ resto            <fctr> Not.resto, Not.resto, resto, Not.resto,...
## $ pub              <fctr> Not.pub, Not.pub, Not.pub, Not.pub, Not...
## $ Tea              <fctr> black, black, Earl Grey, Earl Grey, Ear...
## $ How              <fctr> alone, milk, alone, alone, alone, alone...
## $ sugar            <fctr> sugar, No.sugar, No.sugar, sugar, No.su...
## $ how              <fctr> tea bag, tea bag, tea bag, tea bag, tea...
## $ where            <fctr> chain store, chain store, chain store, ...
## $ price            <fctr> p_unknown, p_variable, p_variable, p_va...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, ...
## $ sex              <fctr> M, F, F, M, M, M, M, F, M, M, M, M, M, ...
## $ SPC              <fctr> middle, middle, other worker, student, ...
## $ Sport            <fctr> sportsman, sportsman, sportsman, Not.sp...
## $ age_Q            <fctr> 35-44, 45-59, 45-59, 15-24, 45-59, 15-2...
## $ frequency        <fctr> 1/day, 1/day, +2/day, 1/day, +2/day, 1/...
## $ escape.exoticism <fctr> Not.escape-exoticism, escape-exoticism,...
## $ spirituality     <fctr> Not.spirituality, Not.spirituality, Not...
## $ healthy          <fctr> healthy, healthy, healthy, healthy, Not...
## $ diuretic         <fctr> Not.diuretic, diuretic, diuretic, Not.d...
## $ friendliness     <fctr> Not.friendliness, Not.friendliness, fri...
## $ iron.absorption  <fctr> Not.iron absorption, Not.iron absorptio...
## $ feminine         <fctr> Not.feminine, Not.feminine, Not.feminin...
## $ sophisticated    <fctr> Not.sophisticated, Not.sophisticated, N...
## $ slimming         <fctr> No.slimming, No.slimming, No.slimming, ...
## $ exciting         <fctr> No.exciting, exciting, No.exciting, No....
## $ relaxing         <fctr> No.relaxing, No.relaxing, relaxing, rel...
## $ effect.on.health <fctr> No.effect on health, No.effect on healt...

We can see that except for the age, all of the variables are categorial – many of them actually binary – and there are 36 of them in total. It’s difficult to visualise or analyse the whole of such a large dataset at once; I actually tried to call ggpairs() for the whole dataset once, but R simply failed to create the temporary image file required to display the plot. Thus, I’m subsetting the data – creating a new dataset called chai in the process – and picking up variables that tell what kind of tea people drink, whether they’re drinking it on the evenings or not, and whether they’re drinking it with their friends or not:

# Etymology note: ”tea” is of Fujian, ”chai” of Cantonese origin. They
# both mean the same. For details, see http://wals.info/chapter/138
chai <- dplyr::select(tea, one_of(c('Tea','How','sugar','how','evening','friends')))
# Rename some columns for clarity.
names(chai)[1] <- 'type'
names(chai)[2] <- 'extras'
names(chai)[4] <- 'packaging'
glimpse(chai)
## Observations: 300
## Variables: 6
## $ type      <fctr> black, black, Earl Grey, Earl Grey, Earl Grey,...
## $ extras    <fctr> alone, milk, alone, alone, alone, alone, alone...
## $ sugar     <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, No...
## $ packaging <fctr> tea bag, tea bag, tea bag, tea bag, tea bag, t...
## $ evening   <fctr> Not.evening, Not.evening, evening, Not.evening...
## $ friends   <fctr> Not.friends, Not.friends, friends, Not.friends...

It’s now more feasible to plot the remaining variables:

ggplot(gather(chai), aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Then, let’s do an MCA for the data and summarise the model:

chai_mca <- MCA(chai, graph = FALSE)
summary(chai_mca)
## 
## Call:
## MCA(X = chai, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.240   0.218   0.196   0.181   0.164   0.161
## % of var.             14.378  13.051  11.757  10.856   9.849   9.643
## Cumulative % of var.  14.378  27.429  39.185  50.041  59.890  69.533
##                        Dim.7   Dim.8   Dim.9  Dim.10
## Variance               0.150   0.131   0.128   0.099
## % of var.              8.974   7.871   7.652   5.970
## Cumulative % of var.  78.507  86.378  94.030 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1         |  0.553  0.426  0.235 | -0.309  0.146  0.073 |  0.149
## 2         |  0.936  1.218  0.481 | -0.238  0.087  0.031 |  0.587
## 3         | -0.291  0.117  0.097 |  0.124  0.024  0.018 | -0.214
## 4         |  0.122  0.021  0.017 | -0.720  0.794  0.583 | -0.090
## 5         |  0.141  0.027  0.018 | -0.163  0.041  0.024 | -0.280
## 6         |  0.466  0.302  0.250 | -0.410  0.258  0.194 | -0.133
## 7         |  0.035  0.002  0.002 | -0.123  0.023  0.024 | -0.067
## 8         |  0.611  0.519  0.182 |  0.009  0.000  0.000 |  0.439
## 9         |  0.370  0.191  0.083 | -0.204  0.064  0.025 |  0.343
## 10        |  0.437  0.266  0.109 |  0.693  0.735  0.274 | -0.046
##              ctr   cos2  
## 1          0.038  0.017 |
## 2          0.586  0.189 |
## 3          0.078  0.053 |
## 4          0.014  0.009 |
## 5          0.134  0.072 |
## 6          0.030  0.020 |
## 7          0.008  0.007 |
## 8          0.328  0.094 |
## 9          0.201  0.072 |
## 10         0.004  0.001 |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black     |   0.802  11.027   0.210   7.933 |   0.857  13.878   0.240
## Earl Grey |  -0.465   9.667   0.390 -10.794 |  -0.292   4.209   0.154
## green     |   0.921   6.483   0.105   5.596 |  -0.213   0.381   0.006
## alone     |   0.097   0.422   0.017   2.277 |   0.072   0.256   0.010
## lemon     |  -1.157  10.244   0.165  -7.034 |   0.274   0.633   0.009
## milk      |   0.211   0.649   0.012   1.879 |  -0.596   5.717   0.094
## other     |   0.674   0.946   0.014   2.048 |   1.614   5.992   0.081
## No.sugar  |   0.488   8.545   0.254   8.718 |   0.419   6.946   0.188
## sugar     |  -0.521   9.135   0.254  -8.718 |  -0.448   7.425   0.188
## tea bag   |   0.093   0.344   0.011   1.847 |  -0.584  14.813   0.446
##            v.test     Dim.3     ctr    cos2  v.test  
## black       8.479 |   0.619   8.040   0.125   6.125 |
## Earl Grey  -6.786 |  -0.016   0.014   0.000  -0.371 |
## green      -1.292 |  -1.295  15.684   0.207  -7.871 |
## alone       1.689 |  -0.464  11.897   0.400 -10.931 |
## lemon       1.666 |   0.213   0.423   0.006   1.292 |
## milk       -5.314 |   0.813  11.801   0.176   7.247 |
## other       4.910 |   3.582  32.731   0.397  10.891 |
## No.sugar    7.489 |  -0.055   0.134   0.003  -0.987 |
## sugar      -7.489 |   0.059   0.143   0.003   0.987 |
## tea bag   -11.550 |   0.162   1.271   0.034   3.211 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## type      | 0.391 0.241 0.279 |
## extras    | 0.176 0.164 0.668 |
## sugar     | 0.254 0.188 0.003 |
## packaging | 0.048 0.458 0.183 |
## evening   | 0.206 0.108 0.035 |
## friends   | 0.363 0.146 0.007 |

What can be seen from the output of the model summary is that none of the dimensions retain a large percentage of the total variance, and that there are no strong links between the dimensions and the categorical variables, even though some of the categories do have coordinate values significantly different from zero. Drawing a variable biplot of the analysis confirms our findings:

par(mfrow = c(1,3)) # Set some graphical params.
plot(chai_mca, choix = "var", title = "MCA variables") # The variable biplot.
plot(chai_mca, choix = "ind", invisible = "var") # The individuals plot.
plot(chai_mca, choix = "ind", invisible = "ind") # The categories plot.

As can be seen from the plot on the left, none of the variables are very strongly linked to either of the dimensions. The strongest link is the packaging variable’s – i.e. whether the person prefers to drink loose tea, teabags or both – link to the second dimension. The rest of the links are quite weak.

The plots on the center and on the right represent the individuals and categories, respectively, and demonstrate that there is no clear pattern in either of them.